Goto

Collaborating Authors

 multimodal environment


CrafText Benchmark: Advancing Instruction Following in Complex Multimodal Open-Ended World

arXiv.org Artificial Intelligence

Following instructions in real-world conditions requires the ability to adapt to the world's volatility and entanglement: the environment is dynamic and unpredictable, instructions can be linguistically complex with diverse vocabulary, and the number of possible goals an agent may encounter is vast. Despite extensive research in this area, most studies are conducted in static environments with simple instructions and a limited vocabulary, making it difficult to assess agent performance in more diverse and challenging settings. To address this gap, we introduce CrafText, a benchmark for evaluating instruction following in a multimodal environment with diverse instructions and dynamic interactions. CrafText includes 3,924 instructions with 3,423 unique words, covering Localization, Conditional, Building, and Achievement tasks. Additionally, we propose an evaluation protocol that measures an agent's ability to generalize to novel instruction formulations and dynamically evolving task configurations, providing a rigorous test of both linguistic understanding and adaptive decision-making.


RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

arXiv.org Artificial Intelligence

Owing to recent advancements, Large Language Models (LLMs) can now be deployed as agents for increasingly complex decision-making applications in areas including robotics, gaming, and API integration. However, reflecting past experiences in current decision-making processes, an innate human behavior, continues to pose significant challenges. Addressing this, we propose Retrieval-Augmented Planning (RAP) framework, designed to dynamically leverage past experiences corresponding to the current situation and context, thereby enhancing agents' planning capabilities. RAP distinguishes itself by being versatile: it excels in both text-only and multimodal environments, making it suitable for a wide range of tasks. Empirical evaluations demonstrate RAP's effectiveness, where it achieves SOTA performance in textual scenarios and notably enhances multimodal LLM agents' performance for embodied tasks. These results highlight RAP's potential in advancing the functionality and applicability of LLM agents in complex, real-world applications.


Syntax-Guided Transformers: Elevating Compositional Generalization and Grounding in Multimodal Environments

arXiv.org Artificial Intelligence

Compositional generalization, the ability of intelligent models to extrapolate understanding of components to novel compositions, is a fundamental yet challenging facet in AI research, especially within multimodal environments. In this work, we address this challenge by exploiting the syntactic structure of language to boost compositional generalization. This paper elevates the importance of syntactic grounding, particularly through attention masking techniques derived from text input parsing. We introduce and evaluate the merits of using syntactic information in the multimodal grounding problem. Our results on grounded compositional generalization underscore the positive impact of dependency parsing across diverse tasks when utilized with Weight Sharing across the Transformer encoder. The results push the state-of-the-art in multimodal grounding and parameter-efficient modeling and provide insights for future research.


Facebook's Captum brings explainability to machine learning

#artificialintelligence

Facebook today introduced Captum, a library for explaining decisions made by neural networks with deep learning framework PyTorch. Captum is designed to implement state of the art versions of AI models like Integrated Gradients, DeepLIFT, and Conductance. Captum allows researchers and developers to interpret decisions made in multimodal environments that combine, for example, text, images, and video, and allows them to compare results to existing models within the library. Developers can also use Captum to understand feature importance or perform a deep dive on neural networks to understand neuron and layer attributions. The tool will also launch with Captum Insights, a visualization tool for visual representations of Captum results.